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Abstract 

In the past, mechanization has been limited to those information 
centers that have access to extensive computer facilities. Now, 
small/low cost computers are available with I/O capacities that make 
them suitable for SDI and retrospective searching on any of the many 
commercially available data bases. A small two-tape computer system 
is assumed, and an analysis of its run-time equations leads to a three- 
step search procedure. Run times and costs are shown as a function of 
file size, number of search terms, and input transmission rates. Actual 
examples verify that it is economically feasible for an information 
center to consider its own small, dedicated computer system. 

* This research was sponsored in part by the National Aeronautics and 
Space Administration Contract NASW - 2307. 
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Introduction 


Until recently, information center mechanization has been limited 
to those centers that have access to extensive computer facilities. 
Although the amount of machine time that a center uses may be small, 
the center is dependent upon the services of highly specialized computer 
personnel. Consequently, computerization has occurred in large centers 
that can afford their own computer system or those centers that have 
convenient access to someone else’s machine. 

By now, the small computer has gained wide acceptance and its 
use for special applications has become commonplace. For example, 

Warheit (1) suggests the IBM System/3 as a small, general purpose computer 
which can perform a great variety of library jobs, such as controlling 
book circulation and producing overdue notices. Furthermore, he feels 
that it is cheap enough so that many libraries which formerly could not 
afford computer services now have the opportunity to have their own 
system (2) . 

IBM has recently announced that inexpensive tape drives will soon 
be available for its System/3. This means that this machine can be used 
by information centers to process externally produced data bases that are 
distributed on tape, such as the Library of Congress MARC file. In 
addition, it will be possible to use this system for SDI and retrospective 
searching on any of the many commercially available data bases. Let us 
now consider how this might be done with a small/low cost computer. 
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The Computer System 

At this point, let us assume that the computer budget is limited 
and that hardware costs are to be kept low. Therefore, let us begin 
with the small system described by Warheit and add two tape drives. 
Now, we have a machine that includes a small CPU with limited memory, 
a card input/output device, a printer, and two tape drives. This 
minimal configuration in a System/3 will rent for approximately 
$2,600.00 to $2,900,00 per month when tapes are available. This same 
configuration is currently available in an IBM 1130 and rents for 
approximately $2,700.00 to $3,400.00 per month. Both include a small 
disk for storage of a monitor and user programs. 

Search Procedure 


With such a small machine, random access searching on inverted 
files is out of the question. This is particularly true for large data 
bases, such as the American Society for Metals (ASM) file of 125,000 
documents or the National Aeronautics and Space Administration (NASA) 
file of 650,000, Furthermore, file inversion on such a limited machine 
would be relatively difficult since the system includes just two tapes 
and one small disk. Consequently, both search and file maintenance 
procedures dictate that data files be organized in a linear mode. This 
requires that each and every document be examined during each search. 
Needless to say, this may result in long search runs. 
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Appendix 1 shows equations that can be used to predict linear search 
times. Analysis of these equations shows that run time is a function of 
three sets of parameters. The first set is dictated by the data file 
and includes the number of data bank documents and the average number 
of index terms per document. The second set is determined by the 
computer and includes average instruction execution time, start time of 
the input tape drive, and time to transmit one input character from the 
drive to memory. The third set is specified by the information center 
and includes the average number of characters per input document and the 
number of search terms per retrieval run. Let us now consider how these 
last two parameters can be varied by the information center so as to 
lessen search times. 

The average number of input characters can be reduced by eliminating 
non-essential information from the data file. Here, document records are 
edited to conform to the needs and demands of center users. On a small 
machine with limited memory, this reformating may require multiple passes. 
Here, the output tape is used for storage of intermediate results which 
become the input to the next pass. 

The number of search terms can be reduced by examining the construc- 
tion of search strategies. For simplicity, let us assume Boolean logic. 
From the equation 


S » A * (B+C) + D * E 

it can be seen that if S is to be true, then one of the following abbre- 


viated equations, S^, must also be true. 
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S^=A+D S2=A+E 

Sj * B + C + D « B + C + E 

If for a given document an is false, then S is also false. Therefore, 

S need only be evaluated for those documents for which Sj is true. 

This approach leads to a three-step retrieval procedure. In the 
first step, the complete file is searched using an abbreviated strategy, Sj, 
Here, documents are read from the data file using one tape drive as input. 

When a document is found that satisfies S^, it is copied onto the other 
tape. If a document does not satisfy S^, it is skipped since it cannot 
fulfill S. After the complete data file has been scanned, the output file 
contains the subset of documents that satisfies S^. During the second step, 
this sub-file is searched using the complete strategy, S. When a document 
is found that satisfies S, it is copied onto the output file along with a 
tag that identifies its requestor. Finally, in the third step, search 
results are printed. Here, documents are grouped together by requestor 
by passing the print tape once for each request. 

Generally, run time for the first step is minimized by selecting an 
abbreviated strategy which contains the least number of terms. For the S 
shown above, this implies Sj or S 2 . Second-step run time is minimized by 
selecting the abbreviated strategy which produces the smallest output during 
the first step. Here, to make a final choice, we must know the relative 
occurrences of terms: A, D, and E. Lastly, print time during the third 
step depends upon the number of documents retrieved by S in the second step. 

If several searches are batched together, the abbreviated equation for 
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the first step is simply the sum of the individual abbreviated equations. 
On the other hand, abbreviated strategies can be selected so as to permit 
the strategy designer to iteratively partition his data base into sub- 
files and to quickly examine sub-file content as described in Wilde (3). 
In contrast, abbreviated strategies can be designed such that several 
complete strategies can be run in parallel during the second step as 
suggested by Junkins and Schultz (4). 

Search Times 


The equations in Appendix 1 predict that search time grows linearly 
as the number of search terms increases. This is confirmed by experimental 
data displayed in Figure 1. Here, run time per 10,000 documents is shown 
for three different input transmission rates as a function of the number 
of abbreviated strategy terms. These graphs can be used to estimate run 
times for different system configurations and data file sizes. For 
example, an SDI search of 25 abbreviated terms on an update of 5,000 
documents using a 15kc input drive takes approximately 

5.000 doc. t 4.26 min. # 2.1 min. 

10.000 doc. 

Similarly, a retrospective search of 10 abbreviated terms on a file of 

650,000 documents using a 60kc drive takes approximately 

650.000 doc . . i.i4 min. # 75 min. 

10.000 doc. 
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Figure 1 - Run Time versus Search Terms (IBM 1130 - No overlap) 



Search costs can be estimated by multiplying predicted run times 
by computer rates. Table 1 presents a set of monthly and hourly rates 
for a two-tape IBM 1130 system using the same three input speeds shown in 
Figure 1. (Hourly rates assume 176 hours/month.) Here, rates are given 
for a total system and for tapes alone. This latter figure would be 
appropriate if a non-tape system were already installed and if the extra 
tape costs were of interest. 

Using the same examples as above, the system cost for the SDI would 


$17 .90/hr . , 2.1 min. o' $.65 


60 min/hr 
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while the tape-only cost for the retrospective would be 

$10. 94/hr . t 75 min> jv $ 13.70 
60 min/hr. 


Summary 

If an information center is to be successful, it must be responsive 
to the demands of its users and clients. If a center has its own computer 
system, it can schedule batched runs, special runs, or evening runs in 
order to satisfy client demands, to meet higher priorities, or to 
overcome equipment failures. When a center has its own machine, it is 
paying a flat rental fee or fixed monthly amortization charge. Thus, 
additional computer use results in a lower per unit cost. Until recently, 
computer systems with good I/O were too expensive for most centers. 

Now, small/low cost machines are available that permit a center to consider 
acquiring its own dedicated computer system. 
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Appendix 1 - Search Time Equations 

Total run time, Tj^, for a linear search is a function of total 
central processor time, T C p U , and total tape time, T tape* ^ computer 
operations are overlapped, T run equals the larger of T C p U ancl T tape* ^ 
they are not, T run equals the sum of T C p U and T tape . 

Total tape time, T ta p e # is a function of total input time, T i n p Ut * 

and total output time, T 0Ut p Ut . Generally, since the number of retrieved 

documents is much less than the number of data bank documents, T . „ 

* output 

is very small relative to T i n p Ut * Therefore, for simplicity, T 0Ut p Ut 
is ignored. 

T input * s t * ie sum t * me spent starting the input tape drive, 

T s tart» ti™ 6 transmitting information from the drive to memory, T trans , 
and the time stopping the drive, T st0 p. Even the most elementary tape 
drive sends an end-of-transmission signal at each inter-record gap; and 
thus, there is no need to wait for the drive to stop. Therefore, 

T tape “ T start + T trans ^ 

The total time spent starting the input drive is the product of the 
time per start, t start , and the number of starts during a search. If the 
file being searched contains and if the input tape is blocked at an 

average of n^ioc^ documents per block, then total start time is 

N doc 

T start * — — — • *start 


n block 


( 2 ) 
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The total tape transmission time is the product of the number of 
documents, N doc , the average number of characters transmitted per 
document, n c ^ ar * and the time to transmit each character, t^^. There- 
fore, total transmission time is 


T 


trans 


N doc * n char * t char 


(3) 


Total tape time is then found by substituting (2) and (3) into (1) 
producing 


T tape = N doc 




'"start 


n block 


+ n char • t char/ 


(4) 


The total central processing time, T £ p U , can be expressed as the 
product of the number of documents, N doc , and the average time to process 
each document, t^ . 


^cpu = ^doc • *doc 

Here, the average time to process each document is the product of 
the average number of instructions to process that document, n^ nst , times 
the time to execute an average instruction, t^ nst » 


*doc n inst * *inst 


( 6 ) 


In processing a linear file, document terms must be compared against 
question terms. The algorithm chosen to perform these comparisons must 
be a function of the average number of index terms per document, i<joc» 



10 


and the number of retrieval search terms, i sea rch* each document 
term is compared against each search term, then the average number of 
instructions to process each document is 

n inst “ *doc * ^search * n comp + n house ^ 

where n comp specifies the number of instructions to make a term com- 
parison while nh ouse represents the instructions to perform housekeeping 
functions. If both search and document terms have been previously sorted, 
this expression can be improved to 

n inst = (■’•doc + ^search) • n comp + n house 


Substituting (6) and (8) into (5) , total CPU time becomes 


T = N , 
cpu doc 


{» 


i doc + *search^ * n comp + 


n house^ 


t inst (9) 


Examination of equations (4) and (9) shows that run time is a 
function of three independent sets of constraints. The first set is 
dictated by the data file and includes the number of data bank documents 
and the average number of index terms per document. The second set is 
determined by the computer and includes average instruction execution time, 
start time of the input tape drive, and time to transmit one input character 
from the drive to memory. The third set is specified by the information 
center and includes the average number of documents per input block, the 
average number of characters per input document, and the number of search 
terms per retrieval run. Here, the number of documents per block should be 
adjusted so as to produce a balanced run where tape time and CPU time are 
nearly equal. Balancing is discussed in Gildersleeve (5). 
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